The instruction below provides you with general keywords for columns used in the lightcast file. See the data schema generated after the load dataset code above to use proper column name. For each visualization, customize colors, fonts, and styles to avoid a 2.5-point deduction. Also, provide a two-sentence explanation describing key insights drawn from the graph.
Load the Raw Dataset: -Use Pyspark to the ‘lightcast_data.csv’ file into DataFrame: -You can reuse the previous code. -Copying code from your friend constitutes plagiarism. DO NOT DO THIS.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/23 21:44:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 1) / 1] [Stage 1:> (0 + 1) / 1]
from pyspark.sql import SparkSessionimport pandas as pdimport plotly.express as pximport plotly.io as pioimport numpy as npimport plotly.graph_objects as gofrom pyspark.sql.functions import col, split, explode, regexp_replace, transform, whenfrom pyspark.sql import functions as Ffrom pyspark.sql.functions import col, monotonically_increasing_idnp.random.seed(42)pio.renderers.default ="notebook"# Initialize Spark Sessionspark = SparkSession.builder.appName("LightcastData").getOrCreate()# Load Datadf = spark.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").option("escape", "\"").csv("./data/lightcast_job_postings.csv")# Show Schema and Sample Data# print("---This is Diagnostic check, No need to print it in the final doc---")# df.printSchema() # comment this line when rendering submission# df.show(5)
25/09/23 21:45:07 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
[Stage 3:> (0 + 1) / 1]